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ABSTRACT 


.  < 

This  paper  deicribes  a  non  von  Neumann  architecture  which  alto  conforms  to 
the  requirements  for  VLSI  implementation  -  expanded  VLSI  architectures.  In  expanded 
VLSI  machines,  more  than  O(n)  processors  are  used  to  solve  0(n)  problems,  where 
inexpensive  (in  terms  of  silicon  area),  fast  processors  have  been  added  to  simplify  the 
processor  interconnections.  Expanded  architectures  are  constructed  by  deriving  algo¬ 
rithms  which  trade  many  of  one  type  of  operation,  like  addition,  for  regularity  of  data 
movement.  An  expanded  architecture  for  the  Discrete  Fourier  Transform  problem  is 
derived.  Three  operational  components  are  described,  each  of  which  can  be  imple¬ 
mented  in  one  (or  a  few)  VLSI  chips.  Optimal  measures  for  silicon  area  and  processing 
time  are  primary  concerns. 


INTRODUCTION 


Research  in  computer  architecture  in  the  last  decade  has  been  driven  largely  by 
the  motivation  to  overcome  the  "von  Neumann  bottleneck."  Many  computer  architects 
believe  that  the  next  generation  of  computers  will  be  based  on  non-von  Neumann  archi¬ 
tecture  capable  of  exploiting  VLSI  processors  [Tre82).  Some  of  these  new  architecture 
designs  tailor  the  number  of  processors  to  match  the  size  of  the  problem;  e.g..  O(n) 
processors  to  solve  0(n)  problems.  Our  paper  describes  a  VLSI  architecture  which  has 
many  more  than  0(n)  processors  where  processor  have  been  added  to  simplify  the 
processor  interconnection  requirements.  With  an  expanded  number  of  processors,  the 
goal  is  to  design  with  lots  of  inexpensive  (in  terms  of  silicon  area),  fast  processors 
which  have  simple  interconnection  requirements. 

New  advances  in  VLSI  technology  have  opened  up  new  horizons  for  the  design  of 
fast,  reliable,  and  efficient  special-purpose  processors.  Since  it  is  now  possible  to  lay 
hundreds  of  thousands  of  transistors  on  a  single  chip,  efficient  algorithms  for  very 
complex  problems  can  be  implemented  in  VLSI.  These  designs  have  to  satisfy  several 
general  conditions,  such  as  regular  layout,  simple  control,  modularity,  and  simple  com¬ 
munication.  The  main  criteria  used  by  researchers  to  evaluate  a  VLSI  design  are  the 
area  of  the  chip  and  the  time  required  to  solve  an  instance  of  the  problem.  Theoretical 
lower  bounds  on  the  area  and  time  have  been  obtained,  together  with  new  designs  for 
many  problems  which  meet  these  bounds.  However,  many  of  these  designs  seem  to  be 
difficult  to  implement  on  single  chips  because  of  the  required  complexity  of  the  pro¬ 
cessing  components.  For  example,  if  we  consider  the  VLSI  design  outlined  in  [PreB3], 
n  multipliers  are  required  on  a  chip  to  compute  the  Discrete  Fourier  Transform  of  n 
points.  However,  with  current  technology,  it  is  not  possible  to  lay  out  more  than  a  few 
multipliers  on  a  single  chip.  Hence,  the  large  number  of  chips  which  would  be  needed 


to  solve  any  but  the  most  trivially  sited  problem  make  the  design  impractical.  We  con¬ 
sider  ia  this  paper  a  VLSI  design  which  is  currently  being  implemented.  Each  process¬ 
ing  component  can  be  implemented  in  one  (or  a  few)  chips. 

We  introduce  in  the  next  section  a  sample  expanded  architecture.  Using  similar 
derivation  techniques,  we  hope  to  be  able  to  construct  a  family  of  expanded,  very  high 
speed  VLSI  arithmetic  processors.  Expanded  architectures  are  based  on  algorithms 
which  trade  many  of  one  type  of  operation  (like  addition)  for  regularity  of  data  move¬ 
ment  and  speed.  Since  a  one  digit  online  adder  is  actually  quite  small,  this  trade  off  is 
not  only  feasible,  but  from  the  point  of  view  of  being  able  to  construct  hardware  which 
is  extremely  fast,  regular,  and  has  minimal  control,  is  very  desirable.  Expanded  archi¬ 
tectures  are  interesting  in  that  fundamental  data  movement  restrictions  (like  the 
amount  of  information  which  can  be  transferred  across  the  boundary  of  an  area)  limits 
performance  and  nothing  else.  That  is.  even  though  the  architecture  seems  to  "throw 
hardware"  at  the  problem,  it  does  achieve  the  Area  x  Punt*  lower  bound  of  n*  which 
holds  for  any  VLSI  processor  [ChMoBl],  Each  component  of  the  expanded  VLSI  machine 
can  be  implemented  in  one  (or  a  few)  chips. 

THE  EXPANDED  VLSI  ARCHITECTURE 

To  show  the  merits  of  our  expanded  architecture,  we  will  consider  the  Discrete 
Fourier  Transform  ( DFT )  problem.  The  DFT  of  the  n  element  vector  jP  is  the  n  ele¬ 
ment  vector  t  which  satisfies  9  =  WJ?,  where  T  is  a  nxn  matrix  such  that  W, j  = 
and  m  is  the  n'tA  primitive  root  of  unity.  By  reformulating  the  DFT  equation  into  ei¬ 
ther  of  the  forms  below,  an  expanded  architecture  for  the  DFT  can  be  derived  in  a 
straightforward  manner. 

Y  *  SsSkCtj^D)  C,xjf  (Prime  Factors) 
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or 

Y  *  StDkCi  (AS,  D,  C, X] '  (Winograd) 

where  D,.  Dr.  A  ere  diagonal  array*.  S,.  Si,  C,.  and  C|  are  array*  whose  elements  are  ei¬ 
ther  -1.  0,  or  1.  and  X  and  Y  are  arrays  whose  elements  in  row  major  form  are  the  ele¬ 
ment*  of  Jt  and  f  using  Cood's  algorithm  with  prime  factors  reduction  [KoPa77]  or 
Winograd's  small  n  algorithm  (Win78).  For  more  background  the  reader  is  referred  to 
[Ja84]. 


The  expanded  DFT  machine  is  designed  by  constructing  components  which  can 
compute  SX  (the  summation  component).  DX  (the  scaling  component),  and  X7  (the 
transpose  component)  and  which  can  be  interconnected  in  an  efficient  manner.  The 
expanded  machine  structure  to  solve  the  DFT  using  the  prime  factors  reduction  is  ob¬ 
tained  by  interconnecting  components  of  these  three  types  as  indicated  in  Figure  1. 


X  -•  Summation  I  -*  Scaling  1  -»  1  Summation  - 


Transpose 


Summation  1  -•  1  Scaling  I  -•  1  Summation  I  •*  Y 


Figure  1.  Expanded  DFT  Machine 


To  aid  in  the  construction  of  the  VLSI  hardware,  each  of  the  components  should 
have,  where  ever  possible,  the  following  properties. 

1)  The  execution  rate  of  each  component  should  be  constant  and  there¬ 

fore  independent  of  the  mathematical  and  physical  characteristics 
of  its  size.  As  we  will  see  later,  this  rate  can  be  made  quite  high. 

2)  Each  component  should  be  constructed  by  interconnecting  in  a  regu¬ 

lar  way  smaller  elements  of  a  few  different  types.  The  size  of  the 


elements  should  be  small  end  independent  of  the  else  of  the  prob¬ 
lem.  Furthermore,  given  the  forms!  description,  the  design  of  a 
component  should  be  straightforward. 

3)  The  size  of  each  component  should  be  realistic  and  should  satisfy  VLSI 

circuit  density  and  size  constraints. 

4)  The  logical  and  the  physical  input/output  characteristics  of  the  vari¬ 

ous  components  should  be  compatible  so  as  to  allow  one  com¬ 
ponent  to  be  connected  to  another  in  a  straightforward  manner. 
Furthermore,  the  number  of  interconnections  between  components 
should  be  realistic. 

77ie  Summation  Component 

The  mathematical  operation  performed  by  the  summation  component  is  given  by 
Z  =  SX  where 

S=  [Sij.  0<i  <n0.  Os  j  <  n,j  . 

X=[jru.O*<<n,.os;  <n,j  . 

and 

Z=  [24 J.  0*i  <n0.  0*y  <n8j 

For  the  DFT  problem,  the  elements  of  Saure  predefined  and  are  either  -1,  0,  or  1.  We 
will  make  use  of  this  fact  by  building  them  into  the  summation  component  itself  rather 
than  supplying  them  to  the  component  as  the  operation  is  performed.  Being  able  to 
tailor  the  component  in  this  manner  will  decrease  the  complexity  and.  hence,  the  size 
of  the  hardware  needed  to  perform  the  calculation.  The  summation  component  will  be 
constructed  by  interconnecting  smaller  elements  of  only  three  different  types  (addi¬ 
tion,  subtraction,  and  delay),  which  share  the  same  rectangular  shape.  The  area  of  the 
summation  component  is  O(*ijno)  and  depends  only  on  n,.  n0,  and  the  size  of  the  ele¬ 
ments.  The  area  of  each  elements  depends  only  on  the  precision  p  of  the  integer 
values  being  either  added  or  subtracted.  Each  of  these  elements  can  be  represented 
as  shown  in  Figure  2. 


where 


^tui  s  Xin  end 

^#y<  ~  if  Sij  s  1 

^•trf  =  ^*i  if  S|  j  s  — 1 

£«m»  =  ^tn  if  Stj  a  0 

Figure  2.  Summation  Clements 

These  elements  are  interconnected  as  indicated  in  Figure  3  to  form  a  summation 
component. 
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Figure  3.  Summation  Component 

Element  S*j.  0  <  i  <  n0.  0  <  j  <  n1§  is  either  an  addition  element  (  5t  j  a  1  )  ,  a  sub¬ 
traction  element  (  StJ  a  -1  ),  or  a  delay  element  (  a  0  ).  Note  that  the  elements 
of  the  input  array  and  the  result  array  are  supplied  and  generated  in  an  element 
skewed  manner.  Assuming  addition  takes  unit  time,  the  time  required  to  compute  SX 
is  2 (n,  -  1 )  +  nt. 

The  principle  difficulty  in  constructing  the  summation  component  in  the  manner 
just  outlined  is  having  to  deal  with  the  relatively  large  number  of  inputs  and  outputs. 
2 p  (n j  4*  n0),  and  the  relatively  large  physical  size  of  a  high  speed,  full  precision  adder. 
This  difficulty  can  be  illustrated  by  observing  that  our  goal  is  to  build  DFT  hardware 
which  can  handle  problems  of  size  at  least  1024  sixteen  bit  numbers.  This  does  not 
seem  to  be  feasible  with  current  or  even  foreseeable  VLSI  technology.  Furthermore, 
the  required  size  of  the  perimeter,  0(p  (n,  no)),  (as  defined  by  the  input/output  re¬ 

quirements  of  the  summation  component)  does  not  fit  well  with  its  area  requirements. 


,  >  •'  -*■  V -» SSV' ■tV' J-'i 
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0(j>tnin9)  Tbia  comment  is  baaed  on  the  obaervation  that  if  the  perimeter  require¬ 
ment  of  a  component  exceeds  the  aquare  root  of  the  area  requirement,  then  either 
area  or  time  ia  probably  wasted.  For  small  precision,  this  will  not  be  problem.  Howev¬ 
er.  for  large  precision  it  will  be. 

The  obvious  first  choice  to  get  around  this  difficulty  would  be  to  share  (via  time 
multiplexing)  some  smaller  number  of  input  and  output  interconnections  and  to  share 
a  fewer  number  of  integer  adders  and  subtractors.  However,  this  approach  would 
suffer  not  only  the  usual  problem  of  having  to  map  a  larger  problem  to  the  available 
hardware  but  also  a  given  element  could  no  longer  be  tailored  for  the  calculation  it 
performs.  Hence,  we  would  would  lose  the  size  advantage  we  obtained  by  tailoring  a 
given  element  to  a  given  task  (addition,  subtraction,  and  delay).  We  would  also  lose  the 
advantage  of  not  having  to  supply  the  elements  of  S.  To  avoid  the  above  problems,  we 
propose  to  use  digit  online  processing  [TrEr77.  Ir0wB3]  where  each  value  is  transmitted 
in  a  digit  serial  manner.  Digit  online  algorithms  for  arithmetic  functions  generate  the 
j1*1  digit  of  the  result  after  having  been  supplied  with  up  to  only  the  first  (j  +■  k)th  digits 
of  the  input  operands,  where  k  is  a  small  integer  corresponding  to  the  digit  online  de¬ 
lay.  By  using  digit  online  arithmetic,  we  are  able  to  reduce  the  perimeter  required 
from  0(p(n,  t  n0))  to  0(n,  tn0)  and  the  area  required  from  0(pan ,nB)  to  0(n,nc) 
Note  that  neither  the  perimeter  nor  the  area  required  depend  on  p ■  The  only  disadvan¬ 
tage  of  using  digit  online  arithmetic  is  that  the  digit  online  version  of  the  summation 
component  must  be  clocked  p  times  for  each  time  the  word  parallel  version  is  clocked 
once.  This  may  not  be  as  much  of  a  problem  as  it  first  appears,  since  the  digit  online 
version  may  be  clocked  at  a  faster  rate  than  the  word  parallel  version  because  its  basic 
elements  are  simpler  (bit  adders  versus  word  adders)  and,  therefore,  faster  [Gur84]. 

We  could  at  this  point  elect  either  left  directed  digit  online  processing  (working 
from  least  significant  digit  to  most  significant  digit)  or  right  directed  digit  online  pro- 


cessing  (working  from  moal  significant  digit  to  ieaat  aigniAcant  digit),  as  integer  addi¬ 
tion  and  aubtraction  can  be  done  either  way.  If  the  conatruction  of  the  aummation 
component  was  the  only  consideration,  we  would  elect  to  use  left  directed  (convention¬ 
al  serial  addition)  as  its  digit  online  delay  is  lower  (k  =  0)  and  the  complexity  of  the 
hardware  necessary  to  perform  the  required  calculations  is  smaller.  However,  as  we 
will  see,  the  scaling  component  requirements  will  force  us  to  use  right  directed  tech¬ 
niques.  Algorithms  for  right  directed,  digit  online  processing  have  been  developed  for 
floating  point  comparison  exchange,  addition,  subtraction,  and  multiplication  (k  =  1); 
for  floating  point  division,  square  root,  and  the  trigonometric  functions  of  sine  and 
cosine  (k  =  3);  and  for  fixed  point  logarithm  and  exponentiation  (k  =  1)  [Owe  BO. 
OweBl]. 


The  digit  online  summation  element  for  5{ j  =  1  can  be  represented  as  shown  in 
Figure  4. 
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where 

=  **» 

hf  =  x«n  +  -  b  where 

Bj  is  chosen  so  that  -(6-1)  <  £  (6  -1)  and 


*•*  -  Rj-i  +  9j 


Figure  4.  Summation  Digit  Online  Element 


The  flag  input  it  uaed  to  notify  the  element  that  the  first  digit  of  a  operand  will  be 
supplied  nest.  The  digit  values  Stj-i  and  g,  are  stored  in  digit  registers  m  the  summa¬ 
tion  element.  The  reset  flag  reset*  the  value  of  gp.,  =  0  to  complete  the  previous 
operation  and  resets  the  value  of  hc  -  0  to  start  the  nest  operation. 

The  number  system  we  have  decided  to  use  in  our  implementation  is  octal  with  a 
maximally  redundant  digit  set  [Atk75]  of 

=  7.  —8.  ~ 5.  ~4.  —3.  —  2,  —1.  0.  1.  2.  3.  4.  5.  8.  7{  instead  of  the  conventional  digit 

set  Df#w  =  (0,  1.  2.  3.  4.  5.  6.  7(  However,  the  digits  of  predefined  values  of  Sy,  will  be 
restricted  to  the  minimally  redundant  set  =  }-4.  -3.  -2,  -1,  0.  1.  2.  3.  4j  In  this 
number  system,  three  representations  for  2B]c  are  0034a.  0041a,  and  0144a  where 
4  =  -4.  An  example  of  the  right  directed,  digit  online  addition  operation  used  in  the 
summation  element  is  shown  below  where  -  3344e  and  Zm  -  3434e  Note  that  this 
example  would  cause  a  full  length  ripple  carry  in  a  conventional  adder. 


The  result  generated  by  the  example  is  Z#wJ  =  7000B  Note  that  there  is  a  digit  online 
delay  of  one.  The  functional  description  of  the  digit  online  subtraction  and  delay  ele¬ 
ments  follows  from  the  description  of  the  addition  element. 

The  digit  online  summation  elements  are  interconnected  as  indicated  in  Figure  5 
to  form  a  summation  component. 


Figure  S.  Digit  Online  Summation  Component 

It  should  be  noted  that  the  manner  in  which  the  input  data  is  supplied  to  the  summa¬ 
tion  component  has  changed  because  of  the  use  of  digit  online  elements.  Figure  6  illus¬ 
trates  how  the  input  data  is  now  supplied  to  summation  component. 

^o.ib  0.0U  <*o.o>s  t*ao>o  ”• 

*’1.1*0  **1,0)4  **1.0*3  **i.o*t  ».oJ»  **t.o*o  "* 

**toV-i  ***.0*4  **to*3  ***.o*t  **to)i  **1.0)0  ■* 

**»,-i.o)o  *■*,- 1.0*1  *V,-i.o*o  ”• 

Figure  6.  Digit  Skewed  Data 

Note,  that  the  elements  of  the  input  array  and  the  result  array  are  now  supplied  and 
generated  in  a  digit  skewed  as  well  as  element  skewed  manner. 

Tha  Scaling  Component 

The  next  component  we  will  consider  is  the  scaling  component.  The  mathemati- 


cal  operation  performed  by  this  component  is  given  by  Z  *  DX  where 
Os  [a^.  0*<  <n9.  0*>  < n,  j  end  <  0  >  *>  Aj  *  0 
Xs  (*.f.OS«  <n„  0*>  <n,j  . 

end 

Z=[AJ.O^t<n0.O^y  <n,j 

For  the  DFT  problem,  the  elements  along  the  major  diagonal  of  D  are  predefined  in* 
teger  constants.  As  with  the  summation  component,  we  will  make  use  of  this  fact  by 
building  them  into  the  scaling  component  itself  rather  than  supplying  them  to  the  com* 
ponent  as  the  operation  is  performed.  This  allows  us  to  reduce  the  circuit  complexity, 
the  number  of  input/output  connections,  and.  hence,  the  size  of  the  component.  The 
scaling  component  will  be  constructed  by  interconnecting  only  one  type  of  element. 
The  area  of  the  scaling  component  is  0(n o)  and  depends  only  n0  and  the  size  of  the  ele* 
ments.  The  area  of  each  element  depends  only  on  the  precision  p  of  the  integer  values 
being  scaled.  This  element  can  be  represented  as  shown  in  Figure  7. 

X*  -  |  TKY)  |  -  Z^a 

where 

=  rx* 

Figure  7.  Scaling  Element 

These  elements  are  interconnected  as  indicated  in  F'igure  B  to  form  a  scaling 


component. 
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Figure  B.  Scaling  Component 


Again  because  of  the  area  difficulties  discussed  with  respect  to  the  summation  com¬ 
ponent.  we  elect  to  use  digit  online  arithmetic.  This  effect  is  even  more  dramatic  in 
the  scaling  component  because  of  the  size  of  a  full  precision  integer  multiplier.  Howev¬ 
er.  while  addition  and  subtraction  can  be  done  equally  well  left  or  right  directed,  multi¬ 
plication  is  best  done  right  directed.  Hus  occurs  because  of  the  following  observations. 
The  product  of  two  p  digit  integer  numbers  is  a  Zp  digit  integer.  In  left  directed  pro¬ 
cessing  the  p  digits  of  each  input  are  supplied  in  a  least  significant  digit  to  most 
significant  digit  order.  More  importantly  .  the  Zp  digits  of  the  result  are  generated  in 
a  right  to  left  order.  The  opposite  holds  for  right  directed  arithmetic.  Hence,  if  we  use 
left  directed  arithmetic,  we  would  have  only  the  least  p  significant  digits  of  the  result, 
after  supplying  the  p  digits  of  each  input  However,  we  are  interested  in  the  p  most 
significant  digits  of  the  result.  Hence,  we  could  either  clock  the  multiplier  p  more 
times  or  use  a  two  stage  multipler  (the  first  part  computes  the  p  least  significant  digits 
which  are,  unfortunately,  then  discarded;  the  second  part  computes  the  p  most 
significant  digits).  Two  stages  must  be  used  so  that  we  may  keep  up  with  the  flow  of  in¬ 
put  digits  as  they  are  supplied  to  the  component.  Neither  of  these  two  options  is  very 
palatable.  However  for  right  directed  arithmetic,  the  p  most  significant  digits  of  the 
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Figure  10.  Digit  Online  Scaling  Component 


Note  that  each  element  holds  one  digit  of  the  appropriate  integer  operand  Thus, 
each  element  contains  a  digit  multiplier  to  form  y  zto,  as  well  as  a  digit  online  adder. 
Like  the  summation  component,  it  should  be  noted  that  the  manner  in  which  the  input 
data  is  supplied  to  the  scaling  component  has  changed  because  of  the  use  of  digit  on¬ 
line  elements.  However,  the  input/output  requirements  of  the  two  are  identical. 
Hence,  the  output  of  a  summation  component  can  be  connected  (with  respect  to  logi¬ 
cal  considerations)  directly  to  the  input  of  a  scaling  component  and  vice  versa. 

77ie  Transpose  Component 

The  last  major  component  we  will  consider  is  the  transpose  component.  Since 
this  component  performs  no  arithmetic  operations  on  the  data  supplied  to  it,  it  is  in 
some  ways  the  simplest.  However,  since  its  data  movement  requirements  are  the  most 
general,  it  is  in  other  ways  the  most  complicated.  The  mathematical  operation  per¬ 
formed  by  this  component  is  Z-lC.  where 


i 


<1 

X.[jT, .,.(>«*  <n,.0*J  <»,]  . 

and 

z  s  [2*4.  o  *  i  <  no.  o*  i  <  ni] 

The  transpose  component  will  be  constructed  by  interconnecting  elements  of  only  one 
type  (storage)  which  have  a  rectangular  shape.  The  area  of  the  transpose  component 
is  0(n | rig)  and  depends  only  on  n,.  n0.  and  the  size  of  the  storage  element.  The  area  of 
the  element  depends  on  the  precision  p  of  the  integer  value  being  stored.  Each  of 
these  elements  can  be  represented  as  shown  in  Figure  11. 

_ 4 

X*  -  |  T  |  -  z*a 
r  i 

Qb  tl 

where 

i/CW*0 

otherwise 

Figure  11.  Storage  Element 

These  elements  are  interconnected  as  indicated  in  Figure  12  to  form  a  transpose 


component. 
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Figure  12.  Transpose  Component 


By  transferring  X  into  the  transpose  component  with  C*  «  0  and  transferring  Z  out 
with  C*  *  0.  the  transpose  can  be  performed.  Note  that  elements  of  input  array  and 
the  result  array  are  not  supplied  and  generated  in  an  element  skewed  manner.  Hence, 
the  input/output  characteristics  of  this  component  are  not  compatible  with  the  other 
two  components.  This  problem  can  be  corrected  by  increasing  the  hardware  area  and 
cost,  but  a  more  desirable  solution  to  this  problem  is  needed. 


CONCLUSIONS 


An  expanded  VLSI  architecture  for  solving  the  DFT  was  presented.  Expanded  ar- 
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chitectures  are  constructed  by  deriving  algorithms  which  trade  many  of  one  type  of 
operation,  like  addition,  for  regularity  of  data  movement.  Three  operational  com¬ 
ponents  were  described,  a  summation  component,  a  scaling  component,  and  a 
transform  component,  each  of  which  can  be  implemented  in  one  (or  a  few)  VLSI  chips. 
Using  the  same  expansion  techniques,  we  are  presently  investigating  other  problem 
domains  for  which  optimal  expanded  architectures  can  be  derived.  The  basic  com¬ 
ponents  may  have  to  be  modified  and  augmented.  However  the  goals  of  regularity  of 
data  movement,  speed,  and  being  able  to  implement  each  component  in  one  (or  a  few) 
chips  remain  the  same. 
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